yannick

Convert PDF to text with pdftotext (command line)

pdftotext is a command line utility that converts PDF files to plain text. It has many options, including the ability to specify the page range to convert, maintain the original physical layout of the text as best as possible, set line endings (unix, dos or mac), and even work with password-protected PDF files.

pdftotextis part of the poppler / poppler-utils / poppler-tools package (depending on the Linux distribution you're using). Install this package as follows:

•.Debian, Ubuntu, Linux Mint, and other Debian/Ubuntu-based Linux distributions:

sudo apt install poppler-utils

•.Fedora:

sudo dnf install poppler-utils

•.openSUSE:

sudo zypper install poppler-tools

•.Arch Linux:

sudo pacman -S poppler

In other Linux distributions use your package manager to install the poppler / poppler-utils package.

Now that the package is installed, you can convert a PDF file to plain text and preserve its layout (I recommend using this -layout option for maintaining the original physical layout, but you can try it without it too) with:

pdftotext -layout input.pdf output.txt

You'll need to replace input.pdf with the name of the PDF file, and output.txt with the name you want the generated TXT file to be called. Also add the paths before filenames if needed (e.g. ~/Documents/mypdf.pdf). If no output text file is specified, pdftotext will name the file with the same file name as the original PDF file.

The layout option preserves the PDF layout when converting it to text, even if multi-column PDF cases.

What if you want to only convert a page range of the PDF to text, instead of the whole PDF file? Use -f (first page to convert) and -l (last page to convert) followed by the page number, like this:

pdftotext -layout -f M -l N input.pdf

Replace M and N with the first and last page number to extract, and input.pdf with the PDF filename.

Want to use mac, dos or unix end-of-line characters? You can specify that too, using -eol followed by mac, dos or unix. E.g. for unix line endings:

pdftotext -layout -eol unix input.pdf

If you don't want to insert page breaks between pages, append -nopgbrk:

pdftotext -layout nopgbrk input.pdf

Want to batch convert all PDF files from a folder to text files? pdftotext doesn't support batch PDF to text conversion (and pdftotext *.pdf doesn't work), but you can convert all the PDF files in a folder to text files by using a Bash FOR loop:

for file in *.pdf; do pdftotext -layout "$file"; done

For more options, run man pdftotext and pdftotext --help.